Skip to main content

All Questions

0votes
1answer
287views

Aren't balanced data sets important in regression?

Why is it that the necessity for balanced data sets is (almost) always exclusively mentioned in the context of classification but not of regression?
Tfovid's user avatar
1vote
0answers
91views

Is it right method to remove instances that are hard to predict before train test split?

In a binary classification problem, I have a slightly unbalanced medical dataset with class distribution: 0:5600, 1:1500 0 without a problem and 1 with a problem. I tried many pipelines, automls, and ...
DOT's user avatar
  • 113
0votes
0answers
287views

Train/ Test split on small dataset along with SMOTE

I have a binary classification imbalanced dataset with 1000 samples ( 15% of class 1, 85% of the rest). My main goal is to build a robust classifier using the following approach. Wanted to know if ...
Vardaan Khanted's user avatar
1vote
1answer
975views

Test set larger than train set [closed]

There is a two class dataset with 1121 values in total, having 230 from same class and 891 from the other class. The training set is choosen as 230+230=460 from both classes and the test set as the ...
Jean's user avatar
1vote
1answer
30views

Many questions training unbalanced and duplicated data

I'm a DS student. I have like 30.000 of bank statements, all labeled with a specific category(cat1, cat2, ...). With that data I'm trying to train a classification model but I found several problems: ...
Jack Fenn's user avatar
-1votes
1answer
80views

using average precision as metric for imbalanced problem (learning curve example) [closed]

I have an imbalanced problem (2% target class) and therefore need an appropriate metric - so I chose average_precision. My code: ...
mathella's user avatar
1vote
1answer
43views

[under/over]-sampling teaches model the wrong distribution?

TLDR: Will under/oversampling during the training phase teach the model the wrong distribution and adversely affect accuracy? Let us assume you want to train a classifier to differentiate between ...
Stephen Lasky's user avatar
3votes
1answer
2kviews

While downsampling training data should we also downsample the validation data or retain validation split as it is?

I am dealing with class imbalance problem. In this case, I am down sampling the majority class lables in the training set. Among training, validation and test splits, the majority class in training ...
Ashwin Geet D'Sa's user avatar
0votes
1answer
1kviews

splitting into train test by train_test_split of float values?

How to split into train test by train_test_split of float values ? I used LabelEncoder but I have about 300K lines and when I used the cross_val I saw ...
user10296606's user avatar
6votes
2answers
6kviews

Resampling for imbalaced datasets: should testing set also be resampled?

Apologies for what is probably a basic question but I have not been able to find a definitive answer either in the literature or in the Internet. When dealing with an imbalanced dataset one possible ...
Jose Manuel Albornoz's user avatar
6votes
2answers
595views

Why real-world output of my classifier has similar label ratio to training data?

I trained a neural network on balanced dataset, and it has good accuracy ~85%. But in real world positives appear in about 10% of the cases or less. When I test network on set with real world ...
Bien's user avatar
2votes
2answers
123views

oversampling data with subclass

Oversampling of under-represented data is a way to combat class imbalance. For example, if we have a training data set with 100 data points of class A and 1000 data points of class B, we can over ...
chaohuang's user avatar
1vote
3answers
4kviews

Downsampling and class ratios

My target variable is whether an application is accepted or not. It is a highly imbalanced target with 98.5% of applications accepted. I am unclear about the concept of downsampling. If I were to ...
Soorya Paturi's user avatar
7votes
2answers
3kviews

How to fix class imbalance in training sample?

I was very recently asked in a job interview about solutions to fix an imbalance of classes in the training dataset. Let's focus on a binary classification case. I offered two solutions: oversampling ...
Learning is a mess's user avatar

close